Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

138 ◾ Bioinformatics

files, we can use “-V” for each one. Instead of using “-V” option several times for multiple

gVCF files, the sample information can be saved in a text file called cohort sample map file.

The file then can be passed in “--sample-name-map” option. The cohort sample map file is

a plain text file that contains two tab-separated columns; the first column is for the sample

IDs and the second column is for the names of the gVCF files. Each sample ID is mapped

to a sample file name as shown in Figure 4.6.

The cohort sample map file can be created manually by the user. However, we can also

use bash script to create it. The following script creates a cohort sample map file for our 13

samples and the file will be as shown in Figure 4.6:

cd gvcf

#a- make file name and absolute path

find “$PWD”/*_chr21.dedup.RG.bqsr.g.vcf.gz -type f -printf ‘%f

%h/%f\n’ > ../tmp.txt

#b- remove _1/2.fastq

awk ‘{ gsub(/_chr21.dedup.RG.bqsr.g.vcf.gz/,”,”, $1); print } ‘

../tmp.txt > ../tmp2.txt

rm ../tmp.txt

#remove space

cat ../tmp2.txt | sed -r ‘s/\s+//g’ > ../tmp3.txt

rm ../tmp2.txt

#replace comma with tab

sed -e ‘s/\,\+/\t/g’ ../tmp3.txt > ../cohort.sample_map

rm ../tmp3.txt

Once we have created the cohort sample map file, we can run GenomicsDBImport tool

to import gVCF sample files and GenotypeGVCFs tool to consolidate the variants of 13

samples in a single VCF file.

#create a database

ref=$(ls ../refgenome/*.fasta)

~/software/gatk-4.2.3.0/gatk \

--java-options -Xmx10g \

GenomicsDBImport \

FIGURE 4.6 Cohort sample map file.